Introduction

A basic task in analysing RNA-seq count data is to detect differentially expressed genes (DEGs). A gene is differentially expressed if the change in the read counts between two experimental conditions is statistically significant. It is possible to perform differential expression analysis with the DESeq2 package. Its differential expression tests are based on a negative binomial generalised linear model.

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.3.6     âś” purrr   0.3.4
## âś” tibble  3.1.8     âś” dplyr   1.0.9
## âś” tidyr   1.2.0     âś” stringr 1.4.0
## âś” readr   2.1.2     âś” forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## Loading required package: S4Vectors
## 
## Loading required package: stats4
## 
## Loading required package: BiocGenerics
## 
## 
## Attaching package: 'BiocGenerics'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## 
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## 
## 
## Attaching package: 'S4Vectors'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## 
## Loading required package: IRanges
## 
## 
## Attaching package: 'IRanges'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## 
## 
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## 
## The following object is masked from 'package:grDevices':
## 
##     windows
## 
## 
## Loading required package: GenomicRanges
## 
## Loading required package: GenomeInfoDb
## 
## Loading required package: SummarizedExperiment
## 
## Loading required package: MatrixGenerics
## 
## Loading required package: matrixStats
## 
## 
## Attaching package: 'matrixStats'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## 
## 
## Attaching package: 'MatrixGenerics'
## 
## 
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## 
## 
## Loading required package: Biobase
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## 
## Attaching package: 'Biobase'
## 
## 
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## 
## 
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## 
## 
## Loading required package: ggrepel

DESeq2

Importing Data

The first step is importing data. An RNA-seq count data should be a non-normalised sequence read counts at either the gene or transcript level presented as a table. The rows are the detected genes, and the columns are the number of sequence fragments assigned to each gene for each sample.

count_data <- read_delim(file = "data/count_data.txt") # import count data
## Rows: 24341 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): tracking_id, locus
## dbl (8): 3_Leukemic_FPKM, 5_Leukemic_FPKM, 6_Leukemic_FPKM, 7_Leukemic_FPKM,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
count_data <- count_data[, -6]
colnames(count_data)[3:9] <- gsub("_FPKM", "", colnames(count_data)[3:9])
colnames(count_data)[3:9] <- gsub("-", "_", colnames(count_data)[3:9])

Make a DESeqDataSet Object

The next step is to create an object of class DESeqDataSet, which will store the counts and intermediate calculations we need for the differential expression analysis. We use DESeqDataSetFromMatrix(couvst_dfata, colData, design) for this aim. The first argument is the coun_data. The input should be a matrix, including integer values. Let’s take a look at our count data.

str(count_data)
## tibble [24,341 Ă— 9] (S3: tbl_df/tbl/data.frame)
##  $ tracking_id    : chr [1:24341] "0610005C13Rik" "0610007P14Rik" "0610009B22Rik" "0610009L18Rik" ...
##  $ locus          : chr [1:24341] "chr7:45567794-45589710" "chr12:85815454-85824545" "chr11:51685384-51688634" "chr11:120348677-120351190" ...
##  $ 3_Leukemic     : num [1:24341] 0.01 31.07 14.6 1.07 32.44 ...
##  $ 5_Leukemic     : num [1:24341] 0.0898 37.7451 14.9858 1.8318 31.9144 ...
##  $ 6_Leukemic     : num [1:24341] 0.01 30.57 12.16 1.89 23.98 ...
##  $ 9_pre_leukemic : num [1:24341] 0.01 31.821 11.763 0.655 30.516 ...
##  $ 8_pre_leukemic : num [1:24341] 0.01 34.45 15.24 1.13 29.91 ...
##  $ 10_pre_leukemic: num [1:24341] 0.01 33.28 12.11 1.06 32.7 ...
##  $ 11_pre_leukemic: num [1:24341] 0.01 27.26 14.64 1.06 33.12 ...

The first two columns are characters. Other columns are numeric. And it is a data frame. So, we need to convert it to a matrix. But we have mentioned in the previous session that the data type of matrix values should be the same. We need to convert them to integer values. First, we remove character columns and then we round the numeric values.

# adding gene names as rownames
count_data <- tibble::column_to_rownames(count_data, var = "tracking_id")
head(count_data)
##                                   locus 3_Leukemic 5_Leukemic 6_Leukemic
## 0610005C13Rik    chr7:45567794-45589710     0.0100  0.0898211     0.0100
## 0610007P14Rik   chr12:85815454-85824545    31.0667 37.7451000    30.5707
## 0610009B22Rik   chr11:51685384-51688634    14.6030 14.9858000    12.1555
## 0610009L18Rik chr11:120348677-120351190     1.0670  1.8318000     1.8923
## 0610009O20Rik   chr18:38238404-38262629    32.4397 31.9144000    23.9829
## 0610010B08Rik  chr2:175185164-175338212     0.0100  0.0100000     0.0100
##               9_pre_leukemic 8_pre_leukemic 10_pre_leukemic 11_pre_leukemic
## 0610005C13Rik       0.010000        0.01000         0.01000         0.01000
## 0610007P14Rik      31.820800       34.45450        33.27710        27.26090
## 0610009B22Rik      11.762600       15.23670        12.10740        14.63560
## 0610009L18Rik       0.655317        1.12876         1.06295         1.06027
## 0610009O20Rik      30.516100       29.90590        32.69850        33.12450
## 0610010B08Rik       0.010000        0.01000         0.01000         0.01000
# removing the character column
count_data <- count_data[, -1]
# convert the count_data to a matrix
count_data_mat <- as.matrix(count_data)
# round the values to get rid of the decimal part
count_data_rounded <- round(count_data_mat, digits = 0)
head(count_data_rounded)
##               3_Leukemic 5_Leukemic 6_Leukemic 9_pre_leukemic 8_pre_leukemic
## 0610005C13Rik          0          0          0              0              0
## 0610007P14Rik         31         38         31             32             34
## 0610009B22Rik         15         15         12             12             15
## 0610009L18Rik          1          2          2              1              1
## 0610009O20Rik         32         32         24             31             30
## 0610010B08Rik          0          0          0              0              0
##               10_pre_leukemic 11_pre_leukemic
## 0610005C13Rik               0               0
## 0610007P14Rik              33              27
## 0610009B22Rik              12              15
## 0610009L18Rik               1               1
## 0610009O20Rik              33              33
## 0610010B08Rik               0               0

The second argument is colData. A colData is a data frame containing metadata about each sample. Each column should contain categorical values. It means you need to convert them to factor. The priority of the levels of factor variable is important. Here, pre_leukemic should be the reference. If we had other samples types like normal, the priority should be this sequence: normal, pre-leukemic, leukemic.

# creating colData
colnames(count_data) # check column names
## [1] "3_Leukemic"      "5_Leukemic"      "6_Leukemic"      "9_pre_leukemic" 
## [5] "8_pre_leukemic"  "10_pre_leukemic" "11_pre_leukemic"
# the first column vector
sample_type <- c(rep("leukemic", 3), rep("pre_leukemic", 4))
sample_type <- factor(sample_type, levels = c("pre_leukemic", "leukemic"))
# the second column vector 
gender <- c(rep(c("female", "male"), times = 3), "female")
gender <- factor(gender, levels = c("female", "male"))
colData <- data.frame(sample_type = sample_type,
                      gender = gender,
                      row.names = colnames(count_data))
colData
##                  sample_type gender
## 3_Leukemic          leukemic female
## 5_Leukemic          leukemic   male
## 6_Leukemic          leukemic female
## 9_pre_leukemic  pre_leukemic   male
## 8_pre_leukemic  pre_leukemic female
## 10_pre_leukemic pre_leukemic   male
## 11_pre_leukemic pre_leukemic female

The third argument is design. The DESeqDataSet object store the design formula used to estimate dispersion and log2 fold changes used within the model. Specifying the formula should take the form of a “~” followed by “+” separating factors.

dds <- DESeqDataSetFromMatrix(countData = count_data_rounded,
                              colData = colData,
                              design = ~ sample_type + gender)
## converting counts to integer mode
# View(dds)

You can see that dds stores the count data in assays, the design formula in the design, and the colData in the colData. It is impossible to access the count data directly from a DESeqDataSet object. We need to use assay().

head(assay(dds))
##               3_Leukemic 5_Leukemic 6_Leukemic 9_pre_leukemic 8_pre_leukemic
## 0610005C13Rik          0          0          0              0              0
## 0610007P14Rik         31         38         31             32             34
## 0610009B22Rik         15         15         12             12             15
## 0610009L18Rik          1          2          2              1              1
## 0610009O20Rik         32         32         24             31             30
## 0610010B08Rik          0          0          0              0              0
##               10_pre_leukemic 11_pre_leukemic
## 0610005C13Rik               0               0
## 0610007P14Rik              33              27
## 0610009B22Rik              12              15
## 0610009L18Rik               1               1
## 0610009O20Rik              33              33
## 0610010B08Rik               0               0

Differential Expression Analysis

Pre-filtering

Genes with very low counts don’t lead to reliable differential expression results. We need to do some light pre-filtering. This will reduce the size of the DEseq2DataSet object and speed up the algorithm’s runtime.

Here we are filtering the genes with a sum total of 10 reads on all samples.

nrow(dds)
## [1] 24341
dds <- dds[rowSums(assay(dds)) > 10, ]
nrow(dds)
## [1] 10486

The DESeq()

We use DESeq() to perform differential expression analysis. It performs some steps, including an outlier removal procedure. This function performs a default analysis through these below steps:

  1. estimation of size factors
  2. estimation of dispersion
  3. Negative Binomial GLM fitting and Wald statistics
dds <- DESeq(object = dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
dds
## class: DESeqDataSet 
## dim: 10486 7 
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(10486): 0610007P14Rik 0610009B22Rik ... a l7Rn6
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(7): 3_Leukemic 5_Leukemic ... 10_pre_leukemic 11_pre_leukemic
## colData names(3): sample_type gender sizeFactor

Shrinkage the Results

The estimates of differences between sample groups calculated by DESeq() are not corrected for expression level. When counts are small, it leads to large differences between sample groups. We can fix this by applying a “shrinkage” procedure.

For this aim, we use lfcShrink(), but first we need to know the name and/or the position of the “coefficient”^1 that is calculated by DESeq(). We can use resultsNames(). The resultsNames() returns the names of the model’s estimated effects (coefficients).

resultsNames(dds)
## [1] "Intercept"                           
## [2] "sample_type_leukemic_vs_pre_leukemic"
## [3] "gender_male_vs_female"

We are interested in comparing pre-leukemic and leukemic samples. It is the second element.

shrunken_res <- lfcShrink(dds = dds,
                          # the coefficient we want to reestimate
                          coef = 2, 
                          # We will use `ashr` for estimation
                          type = "ashr")
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041
class(shrunken_res)
## [1] "DESeqResults"
## attr(,"package")
## [1] "DESeq2"
head(shrunken_res)
## log2 fold change (MMSE): sample type leukemic vs pre leukemic 
## Wald test p-value: sample type leukemic vs pre leukemic 
## DataFrame with 6 rows and 5 columns
##                baseMean log2FoldChange     lfcSE    pvalue      padj
##               <numeric>      <numeric> <numeric> <numeric> <numeric>
## 0610007P14Rik  32.15434     0.01240045  0.117411 0.7285111  0.999930
## 0610009B22Rik  13.65484     0.00355700  0.135085 0.9127776  0.999930
## 0610009O20Rik  30.55839    -0.00709896  0.117273 0.8406072  0.999930
## 0610010F05Rik   7.25915     0.01115646  0.151015 0.7077363  0.999930
## 0610010K14Rik  61.84799     0.01081064  0.105673 0.7589358  0.999930
## 0610011F06Rik  22.80726     0.12788310  0.243674 0.0326866  0.488071

Here we describe what each column means:

  • baseMean: The average of the normalised count values, divided by size factors, taken over all samples.

  • log2FoldChange: This value indicates how much the gene or transcript’s expression seems to have changed between the comparison and control groups. This value is reported on a logarithmic scale to base 2.

  • lfcSE: Gives the standard error of the log2FoldChange.

  • pvalue: P-value of the test for the gene or transcript.

  • padj: Adjusted P-value for multiple testing for the gene or transcript.

The summary() helps us to see the results briefly. You can set the alpha argument to 0.05. This is the value of the adjusted p-value.

summary(shrunken_res, alpha = 0.05)
## 
## out of 10486 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 194, 1.9%
## LFC < 0 (down)     : 40, 0.38%
## outliers [1]       : 0, 0%
## low counts [2]     : 407, 3.9%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Missing P-values

Let’s check if there is any missing adjusted p-value.

shrunken_res[is.na(shrunken_res$padj), ]
## log2 fold change (MMSE): sample type leukemic vs pre leukemic 
## Wald test p-value: sample type leukemic vs pre leukemic 
## DataFrame with 407 rows and 5 columns
##                baseMean log2FoldChange     lfcSE    pvalue      padj
##               <numeric>      <numeric> <numeric> <numeric> <numeric>
## 0610040B10Rik   1.70388    -0.01150762  0.206223  0.662990        NA
## 1110006O24Rik   1.56898     0.01088935  0.207915  0.679145        NA
## 1110020A21Rik   1.56561     0.01369499  0.212707  0.608400        NA
## 1500009L16Rik   1.70209    -0.03435321  0.281815  0.277075        NA
## 1500015A07Rik   2.13119    -0.00206597  0.184789  0.936103        NA
## ...                 ...            ...       ...       ...       ...
## Zfp97           1.56632     0.01361175  0.210612  0.610362        NA
## Zkscan16        1.84455    -0.02299480  0.238024  0.421149        NA
## Zkscan7         1.84410    -0.02371225  0.240433  0.409602        NA
## Zscan12         1.56424     0.00325981  0.199363  0.898953        NA
## Zxdb            1.84835    -0.00396092  0.191899  0.877713        NA

Why do we have genes with no adjusted p-value?

Adjusted p-values can be set to NA values if all samples have zero counts within a row, if a row contains a sample with an extreme count outlier, or if a row is filtered because it has a low mean normalised count.

Significant Genes

The genes with adjusted p-value < 0.05 are statistically significant. We are going to select them.

shrunk_res_sig <- subset(shrunken_res, shrunken_res$padj <= 0.05)

Exploring the Results

Volcano Plot

p <- EnhancedVolcano(shrunken_res,
                lab = NA,
                x = 'log2FoldChange', y = 'padj',
                FCcutoff = 1, pCutoff = 0.05,
                title = 'Leukemic vs Pre-leukemic',
                legendPosition = "top")
ggsave(filename = "plots/not_labeled_volcano.jpg",
       plot = p,
       dpi = 600,
       units = "in",
       height = 8, 
       width = 8)

We can name the genes and add a theme layer to the plot.

p <- EnhancedVolcano(shrunken_res,
                lab = rownames(shrunken_res),
                labSize = 3,
                x = 'log2FoldChange', y = 'padj',
                FCcutoff = 1, pCutoff = 0.05,
                title = 'Leukemic vs Pre-leukemic') +
  theme_get()

ggsave(filename = "plots/labeled_volcano.jpg",
       plot = p,
       dpi = 600,
       units = "in",
       height = 5, 
       width = 8)

Heatmap

Before visualising count data, we need to transform it. The variance stabilising transformation or VST is the most accurate and fastest way.

vst_obj <- vst(dds, blind = FALSE)
vst_df <- as.data.frame(assay(vst_obj))

Static

We will use pheatmap() from the pheatmap package to plot the static heatmap. We will visualise the top 30 genes based on the adjusted p-value.

# sort shrunken_res by adjusted p-value ascending
shrunken_res <- shrunken_res[order(shrunken_res$padj),]

# select the top 15 up-regulated
top_15_up <- subset(shrunken_res, shrunken_res$padj <= 0.05 & shrunken_res$log2FoldChange > 0)
top_15_up <- top_15_up[1:15, ]
# select the top 15 down-regulated
top_15_down <- subset(shrunken_res, shrunken_res$padj <= 0.05 & shrunken_res$log2FoldChange < 0)
top_15_down <- top_15_down[1:15, ]

# bind them by row

top_30 <- rbind(top_15_up, top_15_down)

# select top 30 from vst_df
top_30 <- subset(vst_df, rownames(vst_df) %in% rownames(top_30))

p <- pheatmap(top_30,
         scale = "row",
         cluster_cols=TRUE, cluster_rows = TRUE, treeheight_row = 10,
         show_rownames = TRUE,
         main = "Expression of Top 30 Genes")
# ggsave helps us to save our plots in jpg, pdf, png, and other types of images format
ggsave(filename = "plots/static_heatmap.jpg",
       plot = p,
       dpi = 600,
       units = "in",
       height = 8, 
       width = 5)

Let’s change the colour of the heatmap. Here are some of the colour pallets in the RColorBrewer package.

display.brewer.all(colorblindFriendly = TRUE)

# We use RdBu
col_pal <- colorRampPalette(rev(brewer.pal(9, "RdBu")))(200)
# see the result
plot(1:length(col_pal), 1:length(col_pal), col = col_pal, cex = 5, pch=19)

We selected the new colour for the plot. Let’s add an annotation bar for the samples.

p <- pheatmap(top_30,
              col=col_pal, 
              scale = "row",
              cluster_cols=TRUE, cluster_rows = TRUE, treeheight_row = 10,
              show_rownames = TRUE,
              annotation_col = colData,  annotation_names_col = TRUE,
              main = "Expression of Top 30 Genes")
ggsave(filename = "plots/static_annotated_heatmap.jpg",
       plot = p,
       dpi = 600,
       units = "in",
       height = 8, 
       width = 5)

Interactive

To plot the interactive heatmap, we use d3heatmap() from the d3heatmap package. First, we need to cluster the rows and columns.

# clustering
clustRows <- hclust(as.dist(1-cor(t(top_30), method="pearson")), method="single") 
clustColumns <- hclust(as.dist(1-cor(top_30, method="pearson")), method="single")
plot(as.dendrogram(clustColumns))

module.assign <- cutree(clustRows, k = 10)

# plotting

p <- d3heatmap(top_30,
               colors = col_pal,
               Rowv=as.dendrogram(clustRows),
               Colv = as.dendrogram(clustColumns),
               scale='row')
# sevewidget() from htmlwidgets helps us to save HTML files
htmlwidgets::saveWidget(p, 
             file= "plots/interactive_heatmap.html")
p

Principal Component Analysis (PCA)

PCA is an unsupervised method, similar to clustering, to find patterns. It simplifies the complexity of high-dimensional data. You can perform PCA with prcomp().

# we perform PCA based on all of the genes
# Use plotPCA, but return the data for custom plotting

p <- plotPCA(vst_obj,
             ntop = nrow(vst_df), 
             intgroup = "sample_type")
ggsave(filename = "plots/pca_not_labeled.jpg",
       plot = p,
       dpi = 600)
## Saving 7 x 5 in image

Let’s add labels on the points and title.

p <- plotPCA(vst_obj,
             ntop = nrow(vst_df),
             intgroup = "sample_type") +
  geom_text(aes(label = colnames(vst_obj)),
            hjust = "right",vjust = "right", angle = 45) +
  labs(title = "PCA") +
  xlim(-9, 9) + ylim(-9, 9) # it is important to see pca plots in a square panel

ggsave(filename = "plots/pca_labeled.jpg",
       plot = p,
       dpi = 600)
## Saving 7 x 5 in image

Acknowledgement

I would like to thank my supervisor Dr Gregor Reid, Dr Ali Farrokhi, and Tanmaya Atre, for their support and guidance. Also, I would like to thank Reid’s lab members and all of the people who helped me to get through my project.

Resources

  1. https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/

  2. https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#theory

  3. https://htmlpreview.github.io/?https://github.com/AlexsLemonade/training-modules/blob/2021-september/RNA-seq/06-openpbta_heatmap.nb.html